AD-A253  187 


7N  P^AGE 


Form  Approved 
0MB  No  0704-0188 


;  -C.-lSZ't  USt  OHl^  (Le»ve  blank)  2.  REPORT  DATE 

{ 


;  ’  TlE  and  subtitle 

"THEORY  OF  NEURAL  NETWORKS"  (U) 


6  AUTHOR(S) 

Dr.  Yaser  S.  Abu-Mostafa 


*  oeT  'espouse  >nciudin9  tt>e  time  for  revievvng  instructions,  searcnmo  ei'stmg  oata  sources 
Mier?icr»  o*  if'formation  Send  comments  reoaromg  this  burden  estimate  or  an,  other  aspect  of  this 
>.ng».o"  HeaOQuane's  Services.  Oirenorate  tor  information  Operations  ano  Repcrts.  UtS  jetferson 
!  ano  Budget  ^aperwror*  Reduction  ppO}ea{070a-0t88).  Washington.  DC  20503 


3.  REPORT  TYPE  AND  DATES  COVERED 

FINAL  Aug  88  -  31  Jul  91 


5.  FUNDING  NUMBERS 

61102F 


2305/K5 


7  performing  ORGANIZATION  NAME(S)  ANO  AOORESS(ES) 
California  Institute  of  Technology 
Electrical  Eng/Computer  Science 
Pasadena,  CA  91125 


B.  PERFORMMIG  ORGANIZATION 
REPORT  NUMBER 


ftFOSlfTR- 


9  SPONSORING /MONITORING  AGENCY  NAME(S)  ANO  ADORE SS(ES) 


AF03R/NM 
Bldg  410 

Bolling  AFB  DC  20332-6448 


DTIC 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 


AFOSR-88-0213 


11.  SUPPLEMENTARY  NOTES 


12a  DISTRIBUTION  I  AVAILABILITY  STATEMENT 

Approved  for  public  release; 
Distribution  unlimited 


12b.  DISTRIBUTION  CODE 


13.  abstract  (Maximum  200  words) 

A  new  neural  unsupervised  learning  technique  has  been  proposed  in  this  work. 
Technique  is  based  on  the  hlerarcical  partition  of  the  patterns.  Each 
partition  corresponds  to  one  neuron,  which  is  in  general  a  higher-order 

neuron.  The  partition  is  performed  by  Iterating  «t.he  neuron  weights  in  an 
attempt  to  maximize  a  defined  criterion  function.  The  method  is  implemented 
on  several  examples  and  is  found  to  give  good  results.  In  the  second 

implemented  example  the  method  obtained  a  good  solution,  whereas  the 
traditional  adaptive  resonance  method  and  self-organizing  maps  produced 
unsatisfactory  results.  The  method  is  fast,  as  it  takes  typically  from  about 
2  to  5  iterations  to  coverage.  Although  the  proposed  method  is  prone  to  get 
stuck  in  local  minima,  this  did  happen  in  the  simulations  in  only  very 

difficult  problems  and  this  problem  could  be  solved  by  using  gradient 

algorithms  for  searching  for  the  global  maximum,  like  the  Tunneling  Algorithm. 


15.  NUMBER  OF  PAGES 

27 


U.  RRKE  CODE 


’’  security  classification  1 18.  security  classification  1 19.  SECURITY  CLASSIFICATION  I  20.  LIMITATION  OF  ABSTRACT 

OF  REPORT  I  OF  THIS  PAGE  I  OF  ABSTRACT  I 


unclassified 


UNCLASSIFIED 


UNCLASSIFIED 


Stanaard  f-orm  298  (Rev  2-89) 

bv  ANSI  Sto  2)9- ie 

m-fo; 


5 


t 


Final  Techniczd  Report 


THEORY  OF  NEURAL  NETWORKS 


Yaser  S.  Abu-Mostafa  and  Amir  F.  Atiya 


Grant  AFOSR  88-0213 


submitted  to: 

Dr.  Steven  Suddarth 
Air  Force  Office  of  Scientific  Research 
Bolling  Air  Force  Base 
Washington,  DC  20332 


Principal  Investigator 


Yaser  S.  Abu-Mostafa 

Departments  of  Electrical  Engineering  and  Computer  Science 
California  Institute  of  Technology 
Pasadena,  California  91125 


92 


92-19948 


(:i 


TABLE  OF  CONTENTS 


I  THE  VAPNIK  CHERNOVENKIS  DIMENSION:  INFORMATION  VER¬ 


SUS  COMPLEXITY  IN  LEARNING  . 1 

1.1  Introduction  . 1 

1.2  Generalization  . 2 

1.3  The  V-C  dimension  . 3 

1.4  Interpretation . 5 

II  LEARNING  FROM  HINTS  IN  NEURAL  NETWORKS  . 7 

II.  1  Introduction . 7 

11. 2  Im-ariance  hints  . 8 

11.3  Complexity  issues  . 10 

11. 4  Conclusion . 11 


III  AN  UNSUPERVISED  LEARNING  TECHNIQUE  FOR  ARTIFICIAL 


NEURAL  NETWORKS .  15 

111.1  Introduction  . 15 

111.2  The  model  . 15 

111. 3  Extensions  . 18 

111. 4  Implementation  Examples  . 19 

111. 5  Conclusions . 20 


El'IIG  Qu  AJ TTi’’  T]’T';!'ni3>r»rT’T'  jy 


Aeeasslon  For 

HTIS  GRAacI 

ar 

DTIC  TAB 

□ 

Unannounced 

□ 

Justification— 

By - - 

Dlstrlbutloo/ 


Avallabill^  Codes 
iiVaTi  and/or 
SpaolaL 


I 


THE  VAPNIK  CHERNOVENKIS  DIMENSION: 
INFORMATION  VERSUS  COMPLEXITY  IN  LEARNING 

I.l  INTRODUCTION 

We  start  by  formalizing  a  simple  setup  for  learning  from  examples.  We  have  an 
environment  such  as  the  set  of  visual  images,  and  we  call  the  set  X.  In  this  environ¬ 
ment  we  have  a  concept  defined  as  a  fimction  /  :  X  — »  {0, 1},  such  as  the  presence  or 
absence  of  a  tree  in  the  image.  The  goal  of  learning  is  to  produce  a  hypothesis,  also  de¬ 
fined  as  a  function  g  :  X  —*  {0,1},  that  approximates  the  concept  /,  such  as  a  pattern 
recognition  system  that  recognizes  trees.  To  do  this,  we  are  given  a  number  of  exam¬ 
ples  (xi ,  /(xi )),...,  (i AT ,  /(x.v ))  from  the  concept,  such  as  images  with  trees  and  images 
without  trees. 

In  generating  the  examples,  we  assume  that  there  is  an  unknown  probability  distri¬ 
bution  P  on  the  environment  X.  We  pick  each  example  independently  according  to  this 
probability  distribution.  The  statements  in  the  paper  hold  true  for  any  probability  distri¬ 
bution  P,  which  sounds  very  strong  indeed.  The  catch  is  that  the  same  P  that  generated 
the  example  is  the  one  that  is  used  to  test  the  system,  which  is  a  plausible  assiunption. 
Thus  we  learn  the  tree  concept  by  being  exposed  to  ‘typical’  images.  While  X  can  be 
finite  or  infinite  (coimtable  or  uncountable),  we  shall  use  a  simple  language  that  assiimes 
no  measure-theoretic  complications. 

The  hypothesis  g  that  we  produce  approximates  /  in  the  sense  that  g  woxild  rarely 
be  significantly  different  from  /  [7].  This  definition  allows  for  two  tolerance  parameters 
€  and  6.  With  probability  >  1  -  6,  5  will  differ  from  /  at  most  c  of  the  time.  The  6 
parameter  protects  against  the  small,  but  nonzero,  chance  that  the  examples  happen  to 
be  very  atypical. 

A  learning  algorithm  is  one  that  takes  the  examples  and  produces  the  hypothesis.  The 
performance  is  measured  by  the  number  of  examples  needed  to  produce  a  good  hypothesis 
as  well  as  the  running  time  of  the  algorithm. 
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1.2  GENERALIZATION 


We  start  with  a  simple  case  that  may  look  at  first  as  having  little  to  do  with  what  we 
think  of  SIS  generalization.  Suppose  we  make  a  blind  guess  of  a  hypothesis  without  even 
looking  at  zmy  examples  of  the  concept  /.  Now  we  take  some  examples  of  /  and  test  g  to 
find  out  how  well  it  approximates  /.  Under  what  conditions  does  the  behavior  of  g  on  the 
examples  reflect  its  behavior  in  general? 

This  turns  out  to  be  a  very  simple  question.  On  any  point  in  X,  f  and  g  either  agree 
or  disagree.  Define  the  agreement  set 

A={xeX  :  /(x)  =  ^(x)}. 

The  question  now  becomes:  How  does  the  frequency  of  the  examples  in  A  relate  to  the 
probability  of  A?  Let  tt  be  the  probability  of  A,  i.e.,  the  probability  that  /(x)  =  ^(x) 
on  a  point  x  picked  from  X  according  to  the  probability  distribution  P.  We  can  consider 
each  example  as  a  Bernoulli  trial  (coin  flip)  with  probability  ir  of  success  (/  =  g)  and 
probability  1  —  jt  of  failure  (/  ^  5). 

With  N  examples,  we  have  N  independent,  identically  distributed,  Bernoulli  trials. 
Let  n  be  the  number  of  successes  (n  is  a  random  variable),  and  let  1/  *  ^  be  the  frequency 
of  success.  Bernoulli's  theorem  states  that,  by  taking  N  sufficiently  large,  u  can  be  made 
arbitrarily  close  to  tt  with  very  high  probability.  In  other  words,  if  you  take  enough 
exeunples,  the  frequency  of  success  will  be  a  good  estimate  of  the  probability  of  success. 

Notice  that  this  does  not  say  anything  about  the  probability  of  success  itself,  but  rather 
about  how  the  probability  of  success  can  be  estimated  from  the  frequency  of  success.  If  on 
the  exaunples  we  get  90%  right,  we  shoxild  get  about  90%  right  overall.  If  we  get  only  10% 
right,  we  should  continue  to  get  about  the  same.  We  are  only  predicting  that  the  resiilts 
of  the  experiment  with  the  examples  will  persist,  provided  there  are  enough  examples. 

How  does  this  case  relate  to  learning  and  generalization?  After  all,  we  do  not  make  a 
blind  guess  when  we  learn,  but  rather  construct  a  hypothesis  from  the  examples.  However, 
at  a  closer  look,  we  find  that  we  make  a  guess,  not  of  a  hypothesis  but  of  a  set  of  hypotheses. 
For  example,  when  the  backpropagation  algorithm  [6]  is  used  in  a  feedforward  network,  we 
are  implicitly  guessing  that  there  is  a  good  hypothesis  among  those  that  are  obtained  by 
setting  the  weights  of  the  given  network  in  some  fashion.  The  set  of  hypotheses  G  would 
then  be  the  set  of  all  functions  g  that  are  obtained  by  setting  the  weights  of  the  network 
in  any  fashion. 

Therefore,  when  learning  deals  with  a  limited  domain  of  representation,  such  as  a 
given  network  with  free  weights,  we  in  effect  make  a  guess  G  of  hypotheses.  The  learning 
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algorithm  then  picks  a  hypothesis  g  €  G  that  mostly  agrees  with  /  on  the  examples.  The 
question  of  generalization  now  becomes:  Does  this  choice,  which  is  based  on  the  behavior 
on  the  examples,  hold  in  general? 

We  can  approach  this  question  in  a  similar  way  to  the  previous  case.  We  define,  for 
everj’  g  £  G,  the  agreement  set 

=  {i  €  JT  I  fix)  =  ^(x)}. 

These  sets  are  different  for  different  g's.  Let  jt,  be  the  probability  of  Ag,  i.e.,  the  probability 
that  f{x)  =  gix)  on  a  point  x  picked  from  X  according  to  the  probability  distribution  P, 
for  the  particular  y  €  G  in  question.  We  can  again  define  random  variables  rig  (the  number 
of  successes  with  respect  to  different  g's)  ap.d  the  frequencies  of  success  t/g  =  At  this 
point  the  problem  looks  exactly  the  same  as  the  previous  one  and  one  may  expect  the 
same  answer. 

There  is  one  important  difference.  In  the  simple  Bernoulli  case,  the  issue  was  whether 
1/  converged  to  v.  In  the  new  case,  the  issue  is  whether  the  Ug*s  converge  to  the  sr^’s  in  a 
uniform  manner  as  N  becomes  large.  In  the  learning  process,  we  decide  on  one  g  but  not 
the  other  based  on  the  values  of  i/y.  If  we  had  the  t/^’s  converge  to  the  irg%  but  not  in 
a  uniform  manner,  we  could  be  fooled  by  one  erratic  g.  For  example,  we  may  be  picking 
the  hypothesis  g  with  the  maximum  Ug.  With  nonxmiform  convergence,  the  g  we  pick  can 
have  a  poor  ifg.  We  want  the  probability  that  there  is  some  g  £  G  such  that  Vg  differs 
significamtly  from  iTg  be  very  small.  This  can  be  expressed  formally  as 

Pr  sup  \ug  —  7r,|  >  €  <6. 

.9€G 

where  ‘sup’  denotes  the  supremum. 

1.3  THE  V-C  DIMENSION 

A  condition  for  uniform  convergence,  hence  generalization,  was  found  by  Vapnik  and 
Chervonenkis  (8).  The  key  is  the  inequality 

Pr  sup  Wg  —  iTgl  >  e  <  4m(2JV)e~**^^*, 

.j€C» 

where  m  is  a  function  that  depends  on  G.  We  want  the  RHS  of  the  inequality  to  be  small 
for  large  N,  in  order  to  achieve  uniform  convergence.  The  factor  e~**^/*  is  very  helpful, 
since  it  is  exponentially  decajing  in  N.  Unless  the  factor  m(2JV)  grows  too  fast,  we  should 
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be  OK.  For  example,  if  m{2N)  is  polynomial  in  N,  the  RHS  will  go  to  zero  as  N  goes  to 
infinity. 

^\^lat  is  the  function  m?  It  depends  on  the  set  of  hypotheses  G.  Intuitively,  m(N) 
measxires  the  flexibility  of  G  in  expressing  an  arbitrary  concept  on  N  examples.  For  in¬ 
stance,  if  G  contains  enough  hypotheses  to  be  able  to  express  any  concept  on  100  examples, 
one  should  not  really  expect  any  generalization  with  only  100  examples,  but  rather  a  mem¬ 
orization  of  the  concept  on  the  examples.  On  the  other  hand,  if  gradually  more  and  more 
concepts  cannot  be  expressed  by  any  hypothesis  in  G  as  IV  grows,  then  the  agreement  on 
the  examples  means  something,  and  generalization  is  probable.  Formally,  m{N)  measures 
the  maximum  number  of  dilFerent  binary  functions  on  the  examples  Xi, . . . , xjv  induced  by 
the  hypotheses  5i ,  ^2>  •  •  •  €  G. 

For  example,  if  X  is  the  real  line  and  G  is  the  set  of  rays  of  the  form  x  <  a,  i.e.. 


functions  of  the  form 


X  <  a 
X  >  a  ’ 


then  m{N)  =  N  +  1.  The  reason  is  that  on  N  points  one  can  define  only  JV  -I- 1  different 
functions  of  the  above  form  by  sliding  the  value  of  a  from  left  of  the  leftmost  point  all  the 


way  to  right  of  the  rightmost  point. 

There  are  two  simple  facts  about  the  function  m.  First.  m(JV)  <  |G1  (where  j.) 


denotes  the  cardinality),  since  G  caimot  induce  more  functions  that  it  has.  This  fact  is 
useful  only  when  G  is  a  finite  set  of  hypotheses.  The  second  fact  is  that  Tn{N)  <  2^,  since 


G  czmnot  induce  more  binary 


points.  Indeed,  there  are  choices  of  G  (trivially  the  set  of  all  hypotheses  on  X)  for  which 
m{N)  =  2^.  For  those  cases,  the  V-C  inequality  does  not  guarantee  tmiform  convergence. 

The  main  fact  about  m(N)  that  helps  the  characterization  of  G  as  far  as  generalization 
is  concerned  is  that  m(N)  is  either  identically  equal  to  2^  for  all  N,  or  else  is  bounded 
above  by  JV*^-!- 1  for  a  constant  d.  This  striking  fact  can  be  proved  in  a  simple  manner  [4,8]. 
The  latter  case  implies  a  polynomial  m(N)  and  guarantees  generalization.  The  value  of  d 
matters  only  in  how  fast  convergence  is  achieved.  This  is  of  practical  importance  because 
this  determines  the  number  of  examples  needed  to  guarantee  generalization  within  given 


tolerance  parameters. 

The  value  of  d  turns  out  to  be  the  smallest  N  at  which  G  starts  failing  to  induce  all 
possible  2^  binary  functions  on  any  N  examples.  Thus,  the  former  case  can  be  considered 
the  case  d  =  oo.  d  is  called  the  V-C  dimension  [2,3]. 
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1.4  INTERPRETATION 


Training  a  network  with  a  set  of  examples  can  be  thought  of  as  a  process  for  selecting 
a  hypothesis  g  with  a  favorable  performance  on  the  examples  (large  Ug)  from  the  set  G. 
Depending  on  the  characteristics  of  G,  one  can  predict  how  this  performance  will  generalize. 
This  aspect  of  the  characteristics  of  G  is  captured  by  the  parameter  d,  the  V-C  dimension. 
If  the  number  of  examples  N  is  large  enough  with  respect  to  d,  generalization  is  expected. 
This  means  that  maximizing  Vg  will  approximately  maximize  the  real  indicator  of  how 
well  the  h>*pothesis  approximates  the  concept. 

In  general,  the  more  flexible  (expressive,  large)  G  is,  the  larger  its  V-C  dimension  d. 
For  example,  the  V-C  dimension  of  feedforward  networks  grows  with  the  network  size  [2]. 
For  example,  the  total  number  of  weights  in  a  one-hidden-layer  network  is  an  approximate 
lower  bound  for  the  V-C  dimension  of  the  network.  While  a  bigger  network  stands  a 
better  chance  of  being  able  to  implement  a  given  function,  its  demands  on  the  number 
of  examples  needed  for  generalization  is  bigger.  These  are  often  conflicting  criteria.  The 
\'-C  dimension  indicates  only  the  likelihood  of  generalization.  This  means,  for  better  or 
for  worse,  whether  the  behavior  on  the  examples  is  going  to  persist.  The  ability  of  the 
network  to  approximate  a  given  function  in  principle  is  a  separate  issue. 

The  running  time  of  the  learning  algorithm  is  a  key  concern  [5,7].  As  the  number  of 
examples  increases,  the  running  time  generally  increases.  However,  this  dependency  is  a 
minor  one.  Even  with  few  examples,  an  algorithm  may  need  an  excessive  amount  of  time 
to  manipulate  the  examples  into  a  hypothesis.  The  independence  of  this  complexity  issue 
from  the  above  discussion  regarding  information  is  apparent.  Without  a  sufficient  number 
of  examples,  no  algorithm  slow  or  fast  can  produce  a  good  hypothesis.  Yet  a  suffident 
number  of  exzunples  is  of  little  use  if  the  computational  task  of  digesting  the  examples  into 
a  hypothesis  proves  intractable. 
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II 


LEARNING  FROM  HINTS 
IN  NEURAL  NETWORKS 


II.l  INTRODUCTION 

We  can  think  of  learning  from  examples  as  one  end  of  a  spectrum  whose  other  end 
is  explicit  programming.  Between  these  two  extremes,  there  is  a  spectrum  of  largely 
imexplored  possibihties. 

To  explain  what  we  mean,  let  us  assume  that  we  have  a  decision  function  f  :  X  —* 
{0,1}  that  we  wish  to  implement.  For  instance,  the  primality  problem  where  X  = 
{1,2,3,.. .}  and  /(x)  =  1  iff  i  is  a  prime.  Also,  the  problem  of  recognizing  a  tree  in 
an  image  where  A’  is  a  set  of  images  and  /(x)  =  1  iff  x  contains  a  tree.  The  goal  is 
to  come  up  with  an  implementation  of  /.  We  can  write  a  simple  program  for  /  in  the 
primality  problem.  However,  the  lack  of  a  mathematical  imderstanding  of  /  in  the  image 
recognition  problem  forces  us  to  seek  other  approaches.  One  approach  is  learning  from 
examples,  where  we  use  a  ‘learning  process’  that  we  present  with  examples  of  images  with 
trees  and  images  without  trees  until  the  process  infers  an  implementation  of  /.  Whenever 
we  have  an  effective  process  for  learning  from  examples,  it  is  tantamount  to  automated 
programming.  The  process  is  a  mechanical  means  of  producing  an  implementation  of  /. 

^\'hen  feasible,  learning  from  examples  is  a  very  convenient  approach.  It  does  not 
require  any  knowledge  of  /,  just  input-output  examples.  In  many  practical  situations,  we 
do  have  some  knowledge  of  /.  In  these  cases,  it  would  be  inefficient  to  take  blind  examples 
without  taking  advantage  of  what  we  already  know  about  /.  This  gives  rise  to  learning 
from  hints  as  opposev  to  learning  from  examples.  Learning  firom  hints  is  still  a  Jeaming 
process,  since  we  do  not  know  enough  about  /  to  program  it  outright. 

A  hint  is  any  piece  of  information  about  /.  As  a  matter  of  fact,  an  input-output 
example  is  a  special  case  of  a  hint.  A  hint  may  take  the  form  of  a  global  omstraint  on  /, 
such  as  a  symmetry  property  or  an  invuiance.  It  may  also  be  partial  information  about 
the  implementation  of  /. 

A  hint  may  be  valuable  to  the  learning  process  in  two  ways.  It  may  reduce  the  number 
of  functions  that  are  candidates  to  be  /  (information  value),  and  may  reduce  the  number  of 
steps  needed  to  find  the  implementation  of  /  (complexity  value).  For  illustration,  suppose 
we  are  learning  about  an  unknown  integer  JV  (10*  <  N  <  10*)  and  we  want  to  represent 
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it  as  a  six-digit  ntunber.  Which  is  a  more  valuable  hint:  The  number  is  a  prime  or  The 
most  signiScant  digit  is  7  ?  Although  the  first  hint  has  more  information  value,  it  may 
have  less  complexity  value  because  it  does  not  reduce  the  search  space  in  a  way  that  is 
easily  compatible  with  the  desired  representation. 

We  shzill  report  a  positive  result  and  a  negative  result  in  this  paper.  The  positive  result 
is  a  technique  that  incorporates  amy  invariance  hint  in  any  descent  technique  for  learning. 
The  negative  result  is  that  general  learning  in  neural  networks  remains  NP-complete  even 
with  a  hint  that  is  biologically  plausible.  This  paper  provides  a  preliminary  treatment  of 
the  subject  of  hints,  and  many  directions  of  investigation  warrant  further  exploration. 

For  completeness,  we  briefiy  introduce  feedforward  neural  networks  and  their  gradient 
descent  learning.  A  single-output  feedforward  neural  network  (Figure  II.l)  is  a  combina¬ 
tional  circuit  organized  in  layers  of  tmits  (neurons).  Each  neuron  performs  a  threshold 
function  ^(^^  WjUj  —  t),  where  {uj}  are  the  inputs  to  the  neuron,  {u>j}  are  real  numbers 
(weights),  t  is  a  real  number  (threshold),  and  ^  is  a  ‘sigmoid’  function  (soft  or  hard)  that 
varies  between  —1  and  -1-1  monotonically  (the  binary  convention  is  ±1  instead  of  0, 1).  ^ 

For  any  set  of  values  for  the  weights  and  the  thresholds,  the  network  output  y  is  a  fixid 
function  of  the  input  variables  x  =  xi . . .  Xff.  In  order  to  make  the  network  implement 
a  function  /(x),  we  need  to  choose  the  weights  and  the  thresholds  such  that  the  actual 
output  y  and  the  desired  output  /(x)  are  as  close  as  possible.  A  gradient  descent  method 
for  learning  /  from  examples  minimizes  (y  -  /(x))*  for  each  example  by  perturbing  the 
weights  and  thresholds.  The  formula  for  perturbing  the  weights  is 

When  this  formula  is  applied  to  the  neurons  of  a  feedforward  network,  the  resulting  rule  is 
the  backpropagation  algorithm  of  Werbos,  described  in  (Rumelhart,  Hinton,  and  Williams, 
1986). 

II.2  INVARIANCE  HINTS 

Many  of  the  hints  we  have  in  pattern  recognition  problems  are  invariance  hints.  A 
hand  written  letter  S  can  be  deformed  in  many  ways  without  losing  its  identity.  Properties 
such  as  shift  invariance,  scale  invariance,  and  rotation  invariance  are  commonplace  in  image 
recognition. 

An  invariance  property  can  be  formalized  as  a  set  A  of  subsets  A  C  X  such  that  if 
xi,r2  €  A,  then  /(xj)  =  /(X2).  Thus  /  is  invariant  within  each  set  A  €  A.  Interesting 
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invariances  are  usually  common  to  many  functions  /  on  the  same  domain  X.  For  example, 
if  X  is  the  set  of  images,  many  recognition  functions  share  some  form  of  scale  invariance. 
In  this  case,  each  set  A  €  ^  consists  of  images  which  are  scaled  versions  of  one  another. 

How  can  the  invariance  hint  help  the  learning  process?  One  way  is  to  incorporate 
the  hint  directly  in  the  implementation.  For  illustration,  consider  the  implementation  of 
a  function  /  :  {— 1,+1}^  — *  {— 1,+1}  on  a  feedforward  neural  network.  If  we  know  that 
/  is  an  even  function,  i.e.,  /(xi,X2»-*  we  can  incorporate 

this  hint  in  the  network  as  shown  in  Figure  II.2. 

Each  input  Xi  is  applied  to  two  sets  of  neurons,  with  one  set  implementing  the  dual 
functions  of  the  other  set.  The  dual  of  /(ii,X2,. . .  ,x;v)  is  defined  as  — /(— Xj,— X2, . . . ,  — x,v), 
and  can  be  implemented  by  using  the  same  weights  and  the  negative  of  threshold  used  to 
implement  /.  The  outputs  of  the  Iwo  dual  neurons  are  then  combined  into  the  neurons  of 
the  next  layer  using  a  weight  and  its  negative.  From  then  on,  the  function  implemented 
by  the  network  is  forced  to  be  even  in  the  variables  xj, . . . , x^/.  These  constraints  on  the 
neurons  of  the  first  two  layers  must  be  taken  into  consideration  in  the  learning  process 
when  examples  of  /  are  given.  For  instance,  if  we  apply  gradient  descent,  we  caxmot  treat 
aJl  the  weights  as  independent  variables  any  more. 

Being  an  even  function  is  much  simpler  than  the  invariance  hints  we  are  likely  to 
encounter  in  pattern  recognition.  It  may  not  be  as  easy  to  come  up  with  a  structure 
of  the  network  that  automatically  guarantees  a  complicated  invariance,  such  as  elzistic 
deformation.  We  will  develop  a  unified  method  for  incorporating  any  invariance,  not  in 
the  structure  of  the  network,  but  rather  in  the  gradient  descent  method  itself. 

The  key  idea  is  expressing  the  hint  itself  as  a  set  of  exeimples.  Suppose  we  are  dealing 
with  a  function  /  :  {— — ♦  {— 1,+1},  which  is  invariant  imder  cyclic  shift  of  the 
input  bits,  e.g.,  /(xj)  =  /(X2),  where 

Xi  =  -1  -1  +1  +1  -1  +1  -1  +1  -1  -1  -1  -1, 

X2  =  -l  -1  -1  -1  +1  +1  -1  +1  -1  +1  -1  -1. 

The  condition  /(Xi)  s  /(X2)  is  an  example  of  the  hint  as  much  as  /(x)  =  +1  would  be  an 
example  of  the  function.  Just  like  /(x)  =  +1  can  be  enforced  as  a  minimization  of  (y  —  1)^, 
where  y  is  the  output  of  the  network  when  the  input  is  x,  /(xi)  =  /(X2)  can  be  enforced 
as  a  miiumization  of  (yi  —  y2)^,  where  yi  and  y2  are  the  outputs  of  the  network  when  the 
inputs  are  Xi  and  X2,  respectively.  The  gradient  descent  perturbation  of  the  weights  in 
this  case  would  be 
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which  is  similar  to  the  perturbation  due  to  two  examples  of  /. 

The  same  idea  is  valid  for  all  invariances  and  all  descent  techniques.  It  makes  it 
possible  to  incorporate  in  a  regular  algorithm  for  learning  from  examples  what  would 
otherwise  be  a  hard-to-implement  invariance.  The  initial  layers  of  the  network  be 
dedicated  to  learning  the  invariance  (the  learning  counterpart  of  the  aooroach  of  Figure 
II.2). 

A  look  at  the  expression  for  Awi  reveals  that  we  do  not  need  the  actual  value  of  f  in 
order  to  compute  the  perturbation  of  the  weights  due  to  an  example  of  the  invariance  hint. 
This  makes  it  possible  to  generate  an  arbitrary  number  of  (possibly  artificial)  examples  of 
the  bint  for  the  learning  process  without  having  to  compute  /  (or  even  without  the  need 
to  know  which  /  we  are  learning).  Such  resotirce  may  be  valuable  if  we  have  a  limited 
ntimber  of  natural  examples  of  /. 

This  observation  can  be  formalized  within  the  framework  of  Vapnik  and  Chervonenkis 
(1971).  If  the  set  of  candidate  functions  is  significi .  tiy  reduced  by  the  constraint  that  they 
must  satisfy  the  im’ariance  property,  the  numbe:  examples  of  /  needed  for  the  learning 
process  decreases  accordingly  (Abu-Mostafa,  1989). 

II.3  COMPLEXITY  ISSUES 

The  process  of  learning  an  imknown  function  /  in  a  feedforward  network  can  be  con¬ 
sidered  a  search  in  Euclidean  space  for  a  set  of  weights  and  thresholds  that  implements  the 
fimction.  Indeed,  without  restricting  the  domain  of  functions  and  networks,  the  learning 
problem  is  NP-complete  (Judd,  1988).  Other,  apparently  simplistic,  learning  problems 
have  also  been  shown  to  be  NP-complete  (Valiant,  1984). 

Does  the  incorporation  of  hints  in  the  learning  process  reduce  the  time  complexity  of 
learning?  It  is  plausible  that  a  hierarchical  decomposition  of  the  search  space  that  results 
from  learning  different  hints  independently  will  reduce  the  search  time.  However,  we  were 
unable  to  show  a  meaningful  instance  of  a  hint  that  rendered  an  NP-complete  learning 
problem  polynomial-time.  We  will  report  on  an  interesting  hint  that  did  not  change  the 
NP-completenem  at  the  problem. 

Consider  the  general  problem  of  learning  in  a  feedforward  network  with  hard  thresh¬ 
olds.  Assiime  that,  as  part  of  the  input  to  the  learning  process,  we  are  given  the  signs  of 
a  set  of  weights  that  does  implement  the  function  in  question.  This  hint  is  biologically 
motivated.  In  actual  neurobiological  systems,  certain  synapses  are  predisposed  to  be  in¬ 
hibitory  and  others  to  be  excitory.  Only  the  magnitude  of  the  weight,  not  the  sign,  is  left 
to  the  learning  process. 
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The  problem  remains  NP-complete  as  it  turns  out  to  be  polynomially  related  to  the 
old  problem.  To  see  this,  we  replace  each  synapse  of  the  old  network  by  the  subnetwork  of 
Figure  II.3a.  There  are  clearly  choices  of  the  weights  with  the  prescribed  signs  that  leads 
to  £m  arbitrary  equivalent  weight  for  the  ori^nal  synapse. 

It  is  interesting  to  note  that  the  problem  also  remains  NP-complete  if  we  are  given 
the  absolute  values  of  the  weights  and  need  only  to  find  the  signs!  The  reason  is  that 
each  synapse  can  be  replaced  by  the  subnetwork  of  Figure  II.3b.  The  prescribed  moduli  of 
the  weights  in  this  subnetwork  can  be  given  a  pattern  of  signs  that  leads  to  an  arbitrary 
equivalent  weight  (of  finite  accuracy)  for  the  original  synapse.  Since  we  never  need  more 
than  a  polynomial  number  of  bits  for  the  weight  (Hong,  1987),  the  pol}momial  equivalence 
is  established. 

II.4  CONCLUSION 

Hints  are  pieces  of  information  about  an  unkown  function  that  we  wish  to  learn, 
ranging  from  input-output  examples  to  a  complete  implementation  of  the  function.  In¬ 
variance  hints  can  be  incorporated  into  descent  methods  of  learning  from  examples,  with 
possible  gains  in  information  and  complexity.  Certain  strong  hints  do  not  change  the 
NP-completeness  status  of  learning. 

Several  directions  of  investigating  the  subject  of  hints  remain  open.  To  name  a  few, 
the  compatibility  of  the  hints  with  the  desired  implementation  of  /  and  with  each  other 
and  how  this  affects  their  complexity  value,  the  quantification  of  the  information  value 
of  a  hint  in  terms  of  the  change  in  the  V-C  dimension,  and  finding  natural  examples  of 
NP-complete  learning  problems  that  become  polynomial-time  using  plausible  hints. 
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AN  UNSUPERVISED  LEARNING  TECHNIQUE 
FOR  ARTIFICIAL  NEURAL  NETWORKS 

in.l  INTRODUCTION 

One  of  the  most  important  features  in  neural  networks  is  its  learning  ability,  which 
makes  it  in  general  stiitable  for  computational  applications  whose  structure  is  relatively 
unknown.  Pattern  recognition  is  one  example  of  such  kinds  of  problems.  For  pattern 
recognition  there  are  mainly  two  types  of  learning:  supervised  and  unsupervised  learning 
(refer  to  [1]).  For  supervised  learning  the  training  set  consists  of  typical  patterns  whose 
class  identities  are  known.  For  unsupervised  learning,  on  the  other  hand,  information 
about  the  class  membership  of  the  training  patterns  is  not  given,  either  because  of  lack  of 
knowledge  or  because  of  the  high  cost  of  providing  the  class  labels  associated  with  each  of 
the  training  patterns. 

A  number  of  neural  network  models  for  imsupervised  learning  have  been  proposed, 
for  example  Grossberg’s  adaptive  resonance  [2]  and  Kohonen’s  self>organizing  m^s  [3]. 
In  pattern  recognition  problems  the  pattern  vectors  tend  to  form  clusters,  a  cluster  for 
each  cl^lss,  and  therefore  the  first  step  for  a  typical  tinsupervised  learning  technique  is  the 
estimation  of  these  clusters.  The  clusters  are  usually  separated  by  regions  of  low  pattern 
density.  We  present  here  a  new  method  for  unsupervised  learning  in  neural  networks 
(see  also  [4]).  The  basic  idea  of  the  method  is  to  update  the  weights  of  the  neurons  in 
a  way  to  move  the  decision  boimdaries  in  places  sparse  in  patterns.  This  is  one  of  the 
most  natural  ways  to  partition  clusters.  It  is  more  or  less  similar  to  the  way  hiimans 
decompose  clusters  of  pcwts  in  three  or  less  dimensions  visually.  This  way  contrasts  with 
the  traditional  approach  of  the  adaptive  resonance  and  the  self-organizing  maps,  whereby 
the  estimation  of  the  membership  of  a  pattern  vector  depends  only  oa  respectively  the 
inner  products  and  the  Euclidean  distances  between  this  vector  and  suitably  estimated 
class  representative  vectors. 

III.2  THE  MODEL 

Let  represent  the  training  pattern  vectors.  Let  the  dimension  of  the 

vectors  be  n.  The  training  patterns  represent  several  different  classes.  The  class  identities 
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of  the  training  patterns  are  not  known.  Our  purpose  is  to  estimate  the  number  of  classes 
available  as  well  as  the  parameters  of  the  netiral  classifier.  Let  us  first  consider  the  case 
of  ha^4ng  two  classes.  Our  neural  classifier  consists  of  one  nexiron  (see  Fig.  III.l)  with  n 
inputs  corresponding  to  the  n  components  of  the  pattern  vector  x  to  be  classified.  The 
output  of  the  neuron  is  given  by 

y  =  f(w^x  +  ti?o) 


where  w  is  a  weight  vector,  tVo 
f(0)  =  0,  e.g. 


is  a  threshold  and  /  is  a  sigmoid  fimction  from  -1  to  1  with 


/(ti) 


1-c— 
l  +  e— ■ 


Positive  output  means  class  1,  negative  output  means  class  2  and  zero  output  means 
undecided.  The  output  can  be  construed  as  indicating  the  degree  of  membership  of  the 
pattern  to  each  of  the  two  classes;  thus  we  have  a  fuzzy  classifier. 

The  decision  boundary  is  the  hyperplane 


w^x  +  u>o  =  0 


(see  Fig.  III. 2).  Patterns  near  the  decision  boimdary  will  produce  outputs  close  to  zero 
while  patterns  far  away  from  the  boundary  will  give  outputs  close  to  1  or  -1.  Assuming 
11^ II  ^  where  a  is  a  constant,  a  good  classifier  is  one  which  produces  an  output  close 
to  1  or  -1  for  the  training  patterns,  since  this  indicates  its  ’’decisiveness”  with  respect  to 
the  class  memberships.  Therefore,  we  design  the  classifier  so  as  to  maximize  the  criterion 
function: 

J  =  +  Wo)  -  [^5Z/(w^x<»  +  iTo)]” 

j  j 

subject  to  ||w||  <  a,  where  ^  is  a  positive  integer  (for  best  results  we  take  q=2).  The 
first  term  is  a  meastire  of  the  decisiveness  of  the  classifier  with  the  given  test  pattern  set. 
Regarding  the  second  term,  it  has  the  following  purpose.  The  first  term  is  maximized  if 
the  decision  hyperplane  is  very  far  away  fr-om  the  pattenas  resulting  in  the  output  being 
very  close  to  1  for  aU  test  patterns  (or  close  to  -1  for  all  patterns),  i.e.  all  patterns  are 
assigned  to  one  class  only,  which  is  a  trivial  solution.  Incorporating  the  second  term  will 
prevent  this  trivial  solution  because  if  f{w'^x^^^+Wo)  1  for  all  j  (or  —1  for  all  j),  then  J 

will  be  nearly  zero.  However,  J  is  non-negative  for  any  w  and  Wo  because  of  the  following, 

j 
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where 


/  =  +  ^o)- 

j 

The  inequality  follows  because  /,  the  average  of  the  outputs,  is  less  than  one  in  magnitude 
and  q>\.  Then  we  get 

J  >  +  t«o)  -  /]^  >  0 

j 


and  hence  the  trivial  solution  of  assigning  all  patterns  to  one  class  only  is  ruled  out  because 
it  corresponds  to  a  minimum  not  a  maximum  for  J. 

We  iterate  the  weights  in  a  steepest  ascent  manner  in  an  attempt  to  maximize  the 
defined  meastire, 


A 


=  ^  [/(w^'x'X  +  tI>o)  - 


where  p  is  the  step  size.  The  iteration  is  subject  to  the  condition  ||w||  <  a.  Whenever 
after  some  iteration  this  condition  is  violated,  w  is  projected  back  to  the  surface  of  the 
hypersphere  |lw||  =a  a  (simply  by  multiplying  by  a/l|wl|).  The  term  +  rv^)  in 

the  update  expression  gives  the  patterns  near  the  decision  boundary  more  effect  than  the 
patterns  far  away  from  the  decision  boimdary.  This  is  because  /'(u)  is  high  whenever  u 
is  near  zero  goes  to  zero  when  the  magnitude  of  u  is  large.  Thus,  what  the  method 
is  essentially  is  iterating  the  wei^ts  in  a  way  to  move  the  decision  boundary  to  a 
place  sparse  lit  interns.  The  relative  importance  of  patterns  near  the  decision  boundary' 
increases  for  large  a.  This  is  because  w  tends  to  go  to  the  surface  of  the  sphere  HwH  <  a. 
Large  a  will  therefore  result  in  the  argument  of  /'  being  in  general  higher  than  for  the  case 
of  small  a,  thus  resulting  in  a  smaller  strip  with  high  /'  around  the  decision  boundary. 
Of  course  without  the  constraint  w  could  grow  in  an  unbounded  fashion,  resulting  in 
-hwo)  being  very  near  1  or  —1  for  all  j  for  most  possible  partitions  and  we  obtain 
therefore  possibly  bad  solutions. 
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in.3  EXTENSIONS 


We  have  shown  in  the  previous  section  a  method  for  training  a  linear  classifier.  To 
extend  the  analysis  to  the  case  of  a  curved  decision  boundary  we  use  a  higher  order  neuron, 
i.e.  the  output  is  described  by 

y  =  /(U^o  +  +  ■■■+  ^  X •  •  •  Xij^  ) 

•  •»<  —  <«£ 

where  Xi  denotes  the  component  of  the  vector  x  and  L  represents  the  order.  Higher 
order  networks  have  been  investigated  by  several  researchers,  refer  for  example  to  [5], [6] 
and  [7].  They  achieve  simplicity  of  design  while  still  having  the  capability  of  producing 
fairly  sophisticated  non-linear  decision  boundaries. 

The  decision  boundary  for  the  higher  order  neuron  is  given  by  the  equation 
Wo  +  ^u’jx,  -I- - h  ^  •  •  •  Xii  =  0. 

The  higher  order  case  is  essentially  a  linear  classifier  applied  to  augmented  vectors  of  the 
form 

(Xi,.-*,Xn,---,xf,x}'"*X2,  •••,«![:) 

(the  superscripts  denote  here  powers)  using  an  augmented  weight  vector 

(uJi ,  •  •  • ,  U7„,  •  •  • ,  WI...U  ,  ,  U7n...nn)- 


We  can  therefore  apply  basically  the  same  training  procedure  described  for  the  linear 
classifier  case  on  the  set  of  augmented  vectors,  i.e.  we  update  the  weights  according  to 

[/(«>)  - 

i 

[/(«>) 

> 

where  /(u>)  is  the  output  of  the  neuron  when  applying  the  pattern  i.e. 


Uj  =  Wo  •¥  -J-  ♦  — h  u;,-,...ij^Xj^ 


2md 

j 


18 


We  have  considered  so  far  the  two-class  case.  The  extension  to  the  multi-class  case  is 
as  follows.  We  apply  the  previously  described  procedure  (preferrably  the  ctirved  boimdaxy 
one),  i.e.  partition  the  patterns  into  two  groups.  One  group  of  patteriis  S'*"  corresponds  to 
positive  outputs  and  represents  an  estimate  of  a  collection  of  classes.  The  other  group  S~ 
corresponds  to  negative  outputs  and  represents  an  estimate  of  the  remaining  classes.  Then 
we  consider  the  group  separately  (i.e.  we  use  only  the  patterns  in  for  training),  and 
partition  it  further  into  two  groups  S’*’"*’  and  S'*‘~  corresponding  to  positive  and  negative 
outputs  respectively.  Similarly  we  partition  S~  into  S“'*‘  and  S  .  We  continue  in  this 
hierarchical  mzinner  vmtil  we  end  up  with  the  finaJ  classifier,  see  Fig.  III.3  for  an  example. 
We  stop  the  partitioning  of  a  group  whenever  the  criterion  function  J  associated  with 
the  partition  falls  below  a  certain  threshold.  (One  practically  observes  that  J  in  general 
decreases  slightly  after  each  partition  imtil  one  cluster  remains  whose  partitioning  resiilts 
in  a  relatively  abrupt  decrease  of  J.)  The  final  classifier  consists  now  of  several  neurons, 
the  sign  pattern  of  whose  outputs  is  an  encoding  of  the  class  estimate  of  the  presented 
pattern.  Note  that  the  output  of  the  nemon  responsible  for  breaking  up  some  group  plays 
a  role  in  the  encoding  of  only  the  classes  contained  in  that  group.  Therefore,  the  encoding 
of  a  class  could  have  a  number  of  don't  cares.  Refer  to  the  next  section  for  a  constructive 
example  of  the  extension  to  the  multi-class  case. 

in.4  IMPLEMENTATION  EXAMPLES 

The  new  method  is  implemented  on  two  two-class  problems  and  a  five-class  problem. 
In  all  examples  the  patterns  of  each  class  are  generated  &om  bivariate  Gaussian  distri¬ 
butions.  The  first  example  is  a  two-class  problem  with  a  large  overli^  between  the  two 
classes.  Fig.  III.4  shows  the  resulting  decision  boimdary  when  applying  the  new  method 
using  a  first-order  neuron.  One  observes  that  the  obtained  partition  agrees  to  a  large  extent 
to  that  a  human  would  estimate  when  attempting  to  decompose  the  two  clusters  visually. 
Regarding  the  second  example,  we  have  two  classes  with  equal  diagonal  covariance  matri¬ 
ces  whose  diagonal  entries  differ  much,  resulting  in  two  long  cltisters.  Fig.  111.5  shows  the 
results  of  the  application  of  the  proposed  method  using  a  first-order  neuron.  In  the  last 
example  we  have  a  five-class  problem  with  the  means  being  on  the  comers  and  the  center 
of  a  square.  We  used  third-order  neurons.  Fig.  111.6  shows  the  results.  We  ended  up  with 
four  netirons  partitioning  the  patterns.  The  first  neiiron  is  responsible  for  partitioning  the 
whole  collection  of  patterns  into  the  two  groups  S'*"  and  5~,  separated  by  the  boimdary 
Bi  ;  the  second  neuron  partitions  the  group  S'*’  into  the  groups  S'*”*'  and  5'*''",  separated 
by  the  boundary  B2’,  the  third  neuron  divides  S’*"*"  into  S'*"*”*'  and  S'*''*’'~  with  boundary 
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By,  fin2Jly  the  fourth  netiron  partitions  S~  into  S~'*'  and  5  ,  the  boundary  being 

(see  also  Fig.  III.3).  One  observes  that  the  method  partitioned  the  patterns  successfully. 
As  a  pattern  classifier,  the  four  neurons  give  the  encoding  of  the  class  estimate  of  the 
presented  pattern.  The  codes  of  the  five  classes  are  +  +  +X,  +  H — X,  H —  XX,  —XX+, 
and  —XX—,  where  the  +’s  and  — ’s  represent  the  signs  of  the  outputs  and  the  X  means 
don’t  care. 

in.5  CONCLUSIONS 

A  new  neursd  unsupervised  learning  technique  has  been  proposed  in  this  work.  This 
technique  is  based  on  the  hierarchical  partition  of  the  patterns.  Each  partition  corresponds 
to  one  neuron,  which  is  in  general  a  higher-order  neuron.  The  partition  is  performed  by 
iterating  the  netiron  weights  in  an  attempt  to  maximize  a  defined  criterion  function.  The 
method  is  implemented  on  several  examples  and  is  fo\md  to  |pve  good  results.  In  the 
second  implemented  example  (Fig.  III.5)  the  method  obtained  a  good  solution,  whereas 
the  traditional  adaptive  resonance  method  [2]  and  self-organizing  maps  [3]  produced  un¬ 
satisfactory  results.  The  method  is  fast,  as  it  takes  typically  from  about  2  to  5  iterations 
to  converge.  Although  the  proposed  method  is  prone  to  get  stuck  in  local  minima,  this  did 
happen  in  our  simulations  in  only  very  dilBBictilt  problems  and  this  problem  could  be  solved 
by  using  gradient  algorithms  for  searching  for  the  global  maximum,  like  the  Tunneling 
Algorithm  [8]. 
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Fig.  III.l:  The  model  of  a  first  order  netiron  with  a  sigmoid-shaped  function. 


V 


Fig.  in.2:  The  decision  regions  and  the  decision  boundary  of  a  first  order 

neiiron  example. 
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oil  patterns 


class  1  class  2 


Fig.  III. 3:  An  example  of  the  hierarchical  design  of  the  classifier. 


Fig.  III.4:  A  two-cluster  example  with  the  decision  boundary  of  the  designed  classifier. 
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Fig.  II1.6:  A  five-class  example  showing  the  decision  botmdaries  of  the  hierarchically 
designed  classifier.  There  are  four  neurons  associated  with  the  boundaries  £1,  S2,  and 
B4. 
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